arXiv: 1507.08861 vl [cs.MM] 31Jul2015 


Mobile Multi-View Object 
Image Search 

Fatih Qali§ir • 

Ozgiir Ulusoy • 

Ugur Giidiikbay • 

Muhammet Ba§tan 


Abstract High user interaction capability of mobile 
devices can help improve the accuracy of mobile visual 
search systems. At query time, it is possible to capture 
multiple views of an object from different viewing an¬ 
gles and at different scales with the mobile device cam¬ 
era to obtain richer information about the object com¬ 
pared to a single view and hence return more accurate 
results. Motivated by this, we developed a mobile multi¬ 
view object image search system, using a client-server 
architecture. Multi-view images of objects acquired by 
the mobile clients are processed and local features are 
sent to the server, which combines the query image 
representations with early/late fusion methods based 
on bag-of-visual-words and sends back the query re¬ 
sults. We performed a comprehensive analysis of early 
and late fusion approaches using various similarity func¬ 
tions, on an existing single view and a new multi-view 
object image database. The experimental results show 
that multi-view search provides significantly better re¬ 
trieval accuracy compared to single view search. 

1 Introduction 

Smart mobile devices have become ubiquitous. They are 
changing the way people access information. One tradi¬ 
tional way to access information is via text search, by 
entering a few keywords as query (query-by-keyword); 
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but this is usually cumbersome and slow, considering 
the small screen size of mobile devices. As a convenience, 
it is also possible to initiate text queries via speech, if 
automatic speech recognition is available. Sometimes, it 
is very difficult to express a query using only keywords. 
For instance, when a user at a shoe store wants to know 
more about a specific type of shoe (e.g., cheaper prices 
elsewhere, customer reviews), she cannot easily formu¬ 
late a text query to express her intent. It is much eas¬ 
ier to take a photo of the shoe with the mobile device 
camera, initiate a visual search and retrieve visually 
similar results. This is now possible, owing to the re¬ 
cent hardware/software developments in mobile device 
technology, which turned the smart phones with high- 
resolution cameras, image processing capabilities and 
Internet connection into indispensable personal assis¬ 
tants. This in turn triggered research interest in mobile 
visual search and recognition mmm, and motivated 
the industry to develop mobile visual search applica¬ 
tions, such as Google Goggles [12], CamFind [5], Nokia 
Point and Find [2], Amazon Flow [I], Kooaba image 
recognition [25] . 

Mobile devices have some advantages and disadvan¬ 
tages, compared to regular PCs. The advantages are 
higher accessibility, easier user interaction and the abil¬ 
ity to provide context information (e.g., location) us¬ 
ing extra sensors, like GPS and compass. The disadvan¬ 
tages are limited computational power, storage, battery 
life and network bandwidth E3, although these are con¬ 
stantly being improved and will be less of an issue in 
the future. 



Fig. 1 Multi-view images of two different shoes. Online 
stores typically contain multi-view images of the objects. 

The main focus of this work is to leverage the user 
interaction potential of mobile devices to achieve higher 
visual search performance, and hence, provide the users 
with easier access to richer information. When the user 
wants to search for a specific object, she can take a 
photo of the object to initiate a visual search. Addi¬ 
tionally, she can easily tap on the screen to mark the 
object-of-interest and provide extra information to the 
search system to suppress the influence of background 
in matching [33]. More importantly, the user can take 
multiple photos of the object-of-interest from different 








2 


Qali§ir et al. 




Fig. 2 Client-server architecture of our mobile multi-view 
visual search system. 


crease in computation time due to the increase in the 
number of images to be processed. 

To demonstrate the benefits of multi-view search, 
we built a mobile visual search system based on client- 
server architecture (cf. Figure [2j) and using the well- 
known bag-of-visual-words (BoW) approach [TOlfTT] . We 
performed extensive experiments on both single and 
multi-view object image databases with single and multi¬ 
view queries using various fusion strategies and various 
similarity functions. The major contributions of this 
work can be summarized as follows: 


viewing angles and/or at different scales, thereby pro¬ 
viding much more information about the query object. 
We refer to multi-view object image search as providing 
multiple query images of an object from various viewing 
angles and at various scales and combining the query 
images using early/late fusion strategies to achieve higher 
retrieval precision on single- and/or multi-view object 
image databases. High precision on mobile retrieval is 
especially critical because the screen is too small to dis¬ 
play many results, and more importantly, the user usu¬ 
ally does not have much time and patience to check 
more than 10 — 20 results. 

To illustrate the benefits of multi-view object image 
search, consider the multi-view images of two different 
shoes in Figure [lj taken from four different viewing an¬ 
gles at the same scale. Such images are typical on online 
stores, i.e., multi-view images on clean backgrounds. As¬ 
suming the database contains such multi-view images 
for each object, when a user performs a search using 
a photo that is close to one of the available views, the 
results she will get will be better than when the query 
image has a completely different view. Intuitively, if the 
user takes multiple photos of the object from different 
viewing angles, the chance that the query images are 
similar to the ones in the database will increase. This 
is still valid when the database contains single view im¬ 
ages of each object. The effect of multiple scales is simi¬ 
lar. In summary, at query time, the user does not know 
the view and scale of the object images in the database; 
by taking multiple photos from different views and at 
different scales, she can increase the chance of capturing 
views similar to the database images. This is enabled 
by the interactivity of the mobile device. 

In this work, we investigate the benefits of multi¬ 
view object queries and databases, and best ways to 
fuse the information coming from multi-view images. 
Multi-view queries need special treatment to combine 
multiple query/database images using early/late fusion 
strategies [23ll3^] . We show through experiments that 
multi-view images in query and/or database improve 
retrieval precision significantly, at a cost of modest in- 


— We compare various similarity functions for single 
view queries on a publicly available single view ob¬ 
ject image dataset m- 

— We provide a comprehensive analysis of early and 
late fusion strategies for multi-image and multi-view 
queries on the single view object image dataset, and 
compare their performance with the single view queries. 

— We compare the performance of multi-view and sin¬ 
gle view queries on a new, multi-view object image 
dataset, in which each object has multi-view images. 
We also compare the running times of single and 
multi-view queries. 

2 Related Work 

Due to the recent developments in mobile devices with 
cameras, there has been a growing interest in mobile 
visual search, and research works investigate different 
aspects of it, such as architectures, power efficiency, 
speed, and user interaction. Chen and Girod [7] describe 
a mobile product recognition system where the prod¬ 
ucts are CDs, DVDs and books that have printed la¬ 
bels. The system is local feature based, and Compressed 
Histogram of Gradients ( CHOG ) and Scale-Invariant 
Feature Transform (SIFT) are used as local features. 
Two client-server architectures are implemented and 
compared in terms of response time: one is sending im¬ 
ages, the other one is extracting features on the client 
and sending the features. Sending features took five sec¬ 
onds, sending images took ten seconds to respond. This 
means that over slow connections like 3G it is faster to 
extract and send features. 

Storage space and retrieval speed are important in 
mobile visual search. Girod et al. El describe a mobile 
visual search system that adopts the client-server archi¬ 
tecture in which the database is stored on the phone. 
The system uses the Bag-of-Words (BoW) approach, 
and 4 different compact database methods are exper¬ 
imented and their performances are compared. Li et 
al. [2D] propose an on-device mobile visual search sys¬ 
tem. The system uses the BoW approach with a small 
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visual dictionary due to the memory limitation. Addi¬ 
tionally, the most useful visual words are selected to de¬ 
crease the retrieval time considering the processor lim¬ 
itation. Guan et al. m describe an on-device mobile 
image search system, which is based on bag-of-features 
(BoF) approach. The system uses approximate nearest 
neighbor search to use high dimensional BoF descrip¬ 
tors on the mobile device with less memory usage. The 
search system also utilizes the GPS data from the mo¬ 
bile device to reduce the number of images to be com¬ 
pared. Zhou et al. [35] describe a mobile image search 
system that does not use a codebook, and hence, de¬ 
creases the memory usage and quantization errors. The 
system makes use of SIFT descriptors, and the dimen¬ 
sionality of the descriptors is reduced using Principal 
Component Analysis (PCA). 

Mobile devices have high user interaction potential; 
this has been utilized for better retrieval. Joint Search 
with Image Speech and Words ( JIGSAW) [30] is a mo¬ 
bile visual search system that provides multimodal queries. 
This system allows user to speak a sentence and per¬ 
forms text-based image search. The user selects one or 
more images from the result set to construct a visual 
query for content-based image search. In [26], a mo¬ 
bile product image search system that automatically ex¬ 
tracts the object in the query image is proposed. From 
the top n images that have a clean background, ob¬ 
ject masks are found. The object in the query image 
is then extracted by using a weighted mask approach 
and its background is cleaned. The cleaned query im¬ 
age is finally used to perform image search. Extracting 
the object-of-interest and performing the query with a 
clean background is shown to work better. Similarly, 
TapTell [33] is an interactive mobile visual search sys¬ 
tem, in which users take a photo and indicate an object- 
of-interest within the photo using various drawing pat¬ 
terns. 

There are several mobile image search applications 
available on the mobile market. Point&Find [23] allows 
the users point the camera to the scene or object and 
get information about it. Kooaba is a domain-specific 
search system whose target domains are books, DVD 
and game covers [25] . Digimarc Discover [9] is similar 
to Point&Find ; the user points the camera to an ob¬ 
ject and gets information about it. PlinkArt [8] is an¬ 
other domain-specific mobile search system whose tar¬ 
get domain is well-known artworks. The user takes a 
photo of a well-known artwork and gets information 
about it. One of the latest mobile search application 
is CamFind [5], which is a general object search sys¬ 
tem. When the user takes a photo of a scene, products 
are identified and similar objects are listed as a result. 
Another recent mobile search application is Amazon 


Flow pQ; the user points the camera to the object and 
receives information about it. 

Multi-image queries have been used to improve im¬ 
age retrieval. Arandjelovic and Zisserman [3] propose 
an object retrieval system using multiple image queries. 
The user enters a textual query and Google image search 
is performed using this textual query. The top eight 
images are then retrieved and used as query images. 
Early and late fusion methods are applied. Tang and 
Acton [28] propose a system that extracts different fea¬ 
tures from different query images. These extracted fea¬ 
tures are then combined and used as the features of the 
final query image. Another system is proposed in [22] . 
which allows users to select different regions of inter¬ 
est in the image. Then each region is treated as queries 
and results are combined. Zhang et al. [32] describe a 
similar system, which also uses regions; however, these 
regions are extracted automatically and users select re¬ 
gions from the extracted parts. Xue et al. m propose 
a system that uses multiple image queries to reduce the 
distracting features by using a hierarchical vocabulary 
tree. The system focuses on the parts that are common 
in all the query images. The multi query system de¬ 
scribed in [19] uses early fusion; each database image is 
compared with each query image and each query image 
gets a weight according to the similarity between the 
query image and the database image. 

Queries using multiple images have been utilized in 
several existing works to improve the performance of vi¬ 
sual search systems. In this work, we take this one step 
further and leverage the interaction capability of mo¬ 
bile devices to perform more informative visual object 
queries by taking multi-view images of the query object. 
We performed comprehensive analysis of various simi¬ 
larity functions and early/late fusion methods in terms 
of retrieval precision and running time. Furthermore, 
we collected a new multi-view object image dataset to 
perform single and multi-view query experiments. 

3 Proposed Mobile Visual Search System 

The proposed mobile multi-view visual search system 
is based on the well-known BoW approach: the images 
are represented as a histogram of quantized local fea¬ 
tures, called the BoW histogram. First, interest points 
are detected on images; the points are described with 
local descriptors computed around the points. Then, a 
vocabulary (dictionary) is constructed with a set of de¬ 
scriptors from the database, typically using the k-means 
clustering algorithm. Finally, images are represented as 
a histogram of local interest point descriptors quantized 
on the vocabulary. When a query image is received by 
the search system, local features are extracted and BoW 



4 


Qali§ir et al. 


histogram is computed. The query histogram is com¬ 
pared with the histograms stored in the database, and 
the best k results are returned to the user (cf. Figure [2]). 

Local features are key to the performance of the 
search system. In a mobile system, they should also be 
efficiently computable. To this end, we employed two 
fast local feature detectors: Harris and Speeded Up Ro¬ 
bust Features (SURF) 01129]. They detect two types 
of complementary local structures on images: corners 
and blobs. Using complementary interest points are use¬ 
ful for improving the expressive power of features and 
hence the retrieval precision. The detected points are 
represented with the SIFT descriptor. The BoW his¬ 
tograms are computed for Harris and SURF separately, 
and then they are concatenated to obtain the BoW his¬ 
togram of an image. 

For ranking the database images, they need to be 
compared with the query image, based on the BoW 
histograms. It is crucial to select the right similarity 
functions for high retrieval precision and low computa¬ 
tional cost. There are various similarity functions that 
can be used to compare histograms D3KSII2I1. We ex¬ 
perimented with the similarity functions given in Ta¬ 
ble [T] and presented a comparison in terms of retrieval 
precision and running time in Section [5j In the table, 
h q and h d represent the histogram of the query and 
database images, respectively. In the formulas, qi and 
di are the i th histogram bin of query and database his¬ 
tograms, respectively. 


3.1 Multi-View Search 

Image databases typically contain single view images of 
objects or scenes, as in Figure 01 At query time, if the 
user captures and uses a view close to the one in the 
database, she will retrieve the relevant image, but the 
user does not have any idea about the view stored in 
the database. If the query image has a slightly differ¬ 
ent view or scale, the invariancy of local features can 
handle such view/scale changes; but if the view/scale 
change is significant, the system will most probably fail. 
As a solution, the user may take multiple photos from 
different viewing angles and at different scales to in¬ 
crease the chance of providing query images similar to 
the database images. Moreover, if the database images 
are also multi-view, we can expect to get even better 
results. Hence, both the query and database images 
can be multi-view, each object/scene having multi-view 
images, as in Figure 01 In the most general case, the 
query may contain M > 1 images of an object and the 
database may consist of TV > 1 images of each object. 



Fig. 3 Single view images: each image is a typical, single 
view of an object. 



Fig. 4 Multi-view images: each object has multiple images 
from different viewing angles (and/or at different scales). 


Single-view query and single-view database (M = 1, 

N = 1): Both the query and database objects have 
a single image that represents a specific view of the 
object, as in Figure 01 During retrieval, the query 
image is compared to every database image using 
a similarity function (cf. Table 0]) to find best k 
matches. 

Single-view query and multi-view database (M = 1, 

N > 1): The query has single-view (cf. Figure 0]) 
and database objects have multi-view images (cf. 
Figure 0]). During retrieval, early/late fusion meth¬ 
ods (cf. Sections 13.1.11 and 13.1.2ft are employed to 
find and return best k matching database objects. 

Multi-view query and single-view database (M > 1, 

N = 1): The query has multi-view images, the database 
has a single image for each object. During retrieval, 
early/late fusion methods are employed to find and 
return best k matching database images. 

Multi-view query and multi-view database (M > 1, 

N > 1): Both the query and database objects have 
multi-view images. This is the most general case and 
comprises the previous three cases. During retrieval, 
early/late fusion methods are employed to find and 
return best k matching database objects. We ex- 
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Table 1 Similarity functions used for comparing the BoW histograms. 


Similarity Function 

Symbol 

Formula 

Dot Product [18] 

dot(h q , h d ) 

T,i<lidi 

Histogram Intersection |21| 

HI(h q , h d ) 

Y,imin(qi,di) 

min(\h q \. |/i d |) 

Normalized Histogram 

Intersection fl8l 

NHI(h q , h d ) 

. f Qi di \ 

vEi* Y,i d i) 

Normalized Correlation (21| 

NC(h q , h d ) 

'EiQidi 

x 

Min-Max Ratio [6] 

MinMax(h q , h d ) 

max(qi,di) 


pect to get the best retrieval precision, but at an 
increased computational cost. 

When the query or database objects have multiple 
images, we must employ fusion methods to process the 
queries and find best k matching database objects. This 
is one of the crucial steps to achieve high retrieval per¬ 
formance. There are mainly two types of fusion meth¬ 
ods: early fusion and late fusion. We performed compre¬ 
hensive experimental analysis of several early and late 
fusion methods. 


3.1.1 Early Fusion 

Early fusion, also referred to as fusion in feature space, 
is the approach in which the BoW histograms of mul¬ 
tiple images are combined into a single histogram and 
the similarity function is applied on the combined his¬ 
tograms. We used the early fusion methods given in 
Table [2] [3J. In the table, the histograms for M images 
are combined into h c \ hj is the i th bin of histogram h? 
of image j. 

Table 2 Early fusion methods. 


Method 

Formula 


Sum Histogram 

'*? = Xf=i 

K 

Average Histogram 

sr^M 

h c — = 1 

hi 

* M 


Maximum Histogram 

h\ — max(h \,. . 

•A m ) 


3.1.2 Late Fusion 

Late fusion, also referred to as decision level fusion, con¬ 
siders each query and database image separately to ob¬ 
tain similarity scores between the query and database 
images using their BoW histograms; the final result 
list is obtained by combining the individual similarity 
scores. This can be done in two ways: (1) image sim¬ 
ilarity and ranking and (2) image set similarity and 
ranking. 

Image similarity and ranking. The image histograms in 
the query are compared to all the image histograms of 
all the objects in the database; a single result list is 
obtained by ranking the database objects based on sim¬ 
ilarity scores or ranking. We used the following meth¬ 
ods jT9L'36 . 

— Max Similarity (MAX SIM). Each database image 
is compared with the query images and the similar¬ 
ity is taken as the maximum of the similarities. 

— Weighted Similarity. Each database image is ranked 
according to a weighted similarity to the query im¬ 
ages. 

— Count. For multiple query images, multiple result 
lists are obtained. Then, for each image, a counter 
is incremented if it is in a list. Finally, the counter 
value is used to rank the database images (higher 
value, higher rank). 

— Highest Rank. For multiple query images, multiple 
result lists are obtained and the highest rank is 
taken for each database image. 

— Rank Sum. For multiple query images, multiple re¬ 
sult lists are obtained and the ranking of each image 
in every list is summed and the resulting values are 
used to rank the database images. 

Image set similarity and ranking. First, the similarity 
scores between M images of the query object and N 
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database images of each object are computed, result¬ 
ing in M x TV similarity scores, as shown in Figure [5j 
Then, an image set similarity score between the query 
object and each database object is computed, and fi¬ 
nally, database objects are ranked according to the im¬ 
age set similarity scores. 


- Average of Maximum Similarities (AVERAGE MAX). 
First, the maximum similarity for each of M query 
images to N database object image is computed. 
Then, the average of M maximum similarity values 
is computed as the image set similarity. 



Fig. 5 Similarity computation between image sets. The 
query has M images, the database object has N images. A 
similarity score Sij is computed between every query image 
i and every database object image j, resulting in M x N sim¬ 
ilarity scores. 


The image set similarity scores between M query 
images and N database object images are computed in 
one of the following ways, based on the individual sim¬ 
ilarity scores between the query and database images 
(cf. Figure 0. 

— Maximum Similarity (MAX). The similarity score 
is the maximum of all M x N similarity scores. 


Similarity = max(Sij) 

If at least one of the query images is very similar to 
one of the database object images, this measure will 
return good results. 

— Average Similarity (AVERAGE). The similarity score 
is computed as the average of all M x N similarity 
scores. 

M N 

y y 

Similarity = 

The average operator reduces the effects of outliers, 
but it also reduces the effects of good matches with 
high similarity scores. 

— Weighted Average Similarity (WEIGHTED AVER¬ 
AGE). To promote the influence of good matches 
with high similarity scores, a weight is assigned to 
each score. 


13 M N 

E E % 

i=l j=l 
M N 

Similarity = EE Sij x Wij 

i 3 


c . YT max(S',i,..., Si N ) 

Similarity = —--—- 

— Weighted Average of Maximum Similarities (WEIGHTED 
AVERAGE MAX). This is similar to the previous 
method; this time, the average is weighted. 


Si — max(5^i,..., SiN ) 



M 

Similarity = YAn X Si 

i 


3.2 Speeding up Multi-View Query Processing 

Multi-view queries are inherently computationally more 
expensive than single view queries. However, it is possi¬ 
ble to speed up the multi-view search in a mobile multi¬ 
view search setting. As the user is taking multiple pho¬ 
tos of the query object, the feature extraction and query 
processing can run in parallel in the background. This 
is possible because current mobile devices usually have 
multi-code processors. While one thread handles photo¬ 
taking, another thread can extract and send features to 
the server, which can start query processing as soon as 
it receives the features for the first query image. Fig¬ 
ure [6] shows the flow diagram of the whole process as 
implemented in our mobile search system. 


4 Datasets 

We used two different datasets to evaluate the perfor¬ 
mance of our mobile search system: (i) an existing sin¬ 
gle view mobile product image search dataset, Caltech- 
256 Mobile Product Search Dataset [26], and (ii) a new 
multi-view object image dataset we constructed for this 
work. 

(i) Caltech-256 Mobile Product Search Dataset. This is a 
subset of the Caltech-256 Object Category Dataset m, 
which is used to evaluate the performance of the mobile 
product search system described in [26]. The dataset 
has 20 categories and 844 object images with clean 
background; objects are positioned at the image center. 
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There are 60 query images from six categories; query im¬ 
ages contain background clutter. The original Caltech- 
256 dataset images were downloaded from Google Im¬ 
ages. Figure [7] shows sample images from the dataset. 
This is a single view object image dataset. Although the 
dataset contains multiple images of each category, the 
images are not multiple views of the same object, rather 
they are from different objects of the same category. 



Query Lnae.es 


Fig. 7 Sample images from the Caltech-256 mobile product 
search dataset. 


ages of daily life items sold on online stores, and hence, 
it is easy to generate query images with a mobile device. 

Figure [8] shows sample images from the dataset. The 
dataset is available at www. cs .bilkent. edu.tr/~bilmdg/mvod/. 



Query Images 


Fig. 8 Sample images from the MVOD dataset. 


(ii) Multi-View Object Images Dataset. The main focus 
of this work is mobile multi-view object image search; 
hence it is crucial to have a suitable multi-view object 
image dataset to evaluate the performance of our sys¬ 
tem. To the best of our knowledge, such a dataset is not 
available. We constructed a new dataset from the Inter¬ 
net: Multi-view Object Image Dataset (MVOD). Each 
object has at least two different images taken from dif¬ 
ferent views. MVOD dataset has six categories and 1664 
object images with a clean background; objects are po¬ 
sitioned at the image center. There are 30 query images 
from four categories, taken by a mobile phone (Samsung 
19300 with 8MP built-in camera).. The dataset is suit¬ 
able for a mobile product search system, containing im- 


5 Experiments 

We performed extensive experiments on the Caltech-256 
and MVOD datasets and evaluated the performance 
of various similarity functions and fusion methods. We 
used the OpenCV library m to extract the local fea¬ 
tures (Harris, SURF detector with SIFT descriptor). 
The vocabulary size is 3000. 

We evaluated the proposed image search system in 
terms of average precision (AveP) [15] and computa¬ 
tional cost. The average precision is calculated as shown 
below. In the equation, k represents the rank in the se¬ 
quence and N is the length of the result list. 
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relevant images D first k images 

k 

if image k is relevant 
otherwise 

SfcLi {P(k)xrel(k)) 

N 

Caltech-256 Dataset 

As mentioned above, Caltech-256 mobile product search 
dataset is a single-view dataset. On this dataset, we per¬ 
formed experiments with three types of queries: 

— Single view queries. Each query is a single object im¬ 
age; the same query images, as in [26], are used (six 
categories, each having ten images). Queries with 
clean and cluttered background are performed and 
evaluated separately. We used clean background queries 
provided by [26]. They were obtained by segmenting 
out the objects from the background. 

— Multi-image queries. Each query consists of multiple 
object images from the same category, however, the 
images belong to different objects, they are not mul¬ 
tiple views of the same object. There are six queries 
for six categories, and all ten images are used in each 
multi-image query. 

— Multi-view queries. Each query consists of multi¬ 
view images of an object; the images were taken with 
a mobile phone and hence not from the Caltech-256 
dataset. There are four multi-view queries for four 
categories, each having five images. 

Figure [9] shows the average precision graphs for sin¬ 
gle view queries using various similarity functions for 
our system. The similarity functions Min-Max Ratio , 
Normalized Histogram Intersection and Normalized Cor¬ 
relation work much better than Dot Product and His¬ 
togram Intersection on both clean and background clut¬ 
tered queries. As expected, the average precision is higher 
for queries with a clean background. When we com¬ 
pare our results with those presented in Figure 6 (b) of 
[26], our average precision values are 0.1 — 0.15 higher 
than [26] , probably due to the multiple complementary 
features (Harris+SURF with SIFT) we used. Figure [10] 
shows single view query examples with two different 
similarity functions. 

Figure [TT] shows the average precision graphs for 
multi-image queries using various fusion methods and 
the Min-Max Ratio similarity function. The late fu¬ 
sion methods Rank Sum and Count work better than 
the other early and late fusion methods. The average 
precision values are about 0.25 higher on background 


rel(k) = 

AveP = 

5.1 Results on the 



cluttered queries, and 0.1 higher on clean background 
queries, compared to the single view queries. Figures [12] 
and M show sample queries. 

Figures [H]and [15] show the average precision graphs 
and a sample query for multi-view queries using various 
fusion methods and Min-Max Ratio similarity function. 
As explained above, the multi-view query images of ob¬ 
jects were taken with a mobile phone on a clean back¬ 
ground. The late fusion methods Rank Sum , Weighted 
Similarity and Count work better than the other early 
and late fusion methods. Multi-view queries improve 
the average precision performance further compared to 
multi-image queries, since the query images are multiple 
views of a single object, providing better representation 
for the query object. 


5.2 Results on the MVOD Dataset 

As mentioned above, MVOD is a multi-view dataset 
we prepared to evaluate the performance of our mobile 
search system on multi-view object image databases. 
Since this is a completely different dataset, the results 
are not directly comparable to those of Caltech-256. On 
this dataset, we performed experiments with two types 
of queries: 

— Single view queries. Each query is a single object 
image, selected from the multi-view queries. 

— Multi-view queries. Each query consists of multi¬ 
view images of an object; the images were taken 
with a mobile phone. 

Figure [16] and [T3 show the average precision graphs 
for single view and multi-view queries, respectively. As 
expected, multi-view queries provide a significant im¬ 
provement in average precision (about +0.2) over sin¬ 
gle view queries. The improvement is more on back¬ 
ground cluttered queries, which is important, since, in 
a real world setting, the query images will usually have 
background clutter. On the other hand, the average 
precision for queries with clean background is always 
higher than queries with cluttered backgroud. It is pos¬ 
sible to reduce the influence of background by segment¬ 
ing out the objects automatically, as in [26], or semi- 
automatically if the user can quickly tap on the screen 
and select the object of interest, as in [33] . 

Among the fusion methods, the late fusion methods, 
Weighted Average of Maximum Similarities and Maxi¬ 
mum Similarity , and the early fusion method, Maxi¬ 
mum Histogram , work better than the other methods. 
Sample queries in Figures [18] [T9] and [20] demonstrate 
the improvement in the result lists. Multi-view queries 
provide significant improvement over single-view queries, 
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with both clean and background cluttered query im¬ 
ages. 


5.3 Running Time Analysis 

Multi-view queries are inherently computationally ex¬ 
pensive. In this section, we compare the running times 
of single and multi-view query methods for different 
similarity functions. To do so, we measured the time 
spent for matching the images on the server side; this 
includes the vector quantization, BoW histogram con¬ 
struction, similarity computation, fusion and ranking. 
The measurement is done on the MVOD dataset, with 
five queries each having five images, and the measured 
duration is the average of all queries in each (query) 
category. Table [3] summarizes the results. According to 
the table, the matching times for the similarity func¬ 
tions are close to each other. The increase in running 
time in multi-view queries is not proportional to the 
number of images in the query and database, it is lower 
(due to varying image content and different numbers of 
interest points detected in each image). Based on the 
running times and the average precision performances, 
the late fusion methods, Weighted Average of Maximum 
Similarities and Maximum Similarity , and the early fu¬ 
sion method, Maximum Histogram , can be employed for 
multi-view object image search. 


6 Conclusions 

We investigated the performance of single view, multi¬ 
image and multi-view object image queries on both sin¬ 
gle view and multi-view object image databases using 
various similarity functions and early/late fusion meth¬ 
ods. We conclude that multiple view images, both in the 
queries and in the database, significantly improve the re¬ 
trieval precision. Mobile devices with built-in cameras 
can leverage the user interaction capability to enable 
multi-view queries. The performance can be further im¬ 
proved if the query objects are isolated from the back¬ 
ground. This can be done automatically as in [26] or 
the user can tap on the screen and select the object-of- 
interest in the query image [33]. We developed a mo¬ 
bile search system and evaluated it on two datasets, 
both suitable for mobile product search, which is one 
of the useful application areas of such mobile interactive 
search systems. 
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Table 3 Running times (ms) of similarity functions and fusion methods. 


Similarity Functions 




Normalized 

Correlation 

Histogram 

Intersection 

Normalized Histogram 
Intersection 

Dot Product 

Min-Max Ratio 


Single View Query 
(No Fusion) 

317 

343 

272 

277 

332 


Sum Histogram 

1122 

963 

947 

899 

982 

T3 

Average Histogram 

911 

925 

914 

902 

913 

-S 

Maximum Histogram 

962 

1126 

965 

988 

1010 

tu 

Average Similarity 

1069 

1340 

877 

1242 

1132 


Weighted Average Similarity 

1080 

1323 

899 

1248 

1138 

g 

Maximum Similarity 

1092 

1327 

863 

1232 

1128 

"55 

Average of Max Similarity 

1093 

1346 

858 

1243 

1136 

1 

Weighted Average of Max Similarity 

1118 

1355 

849 

1299 

1130 



- NORMALIZED CORRELATION - HISTOGRAM INTERSECTION - DOT PRODUCT MIN-MAX RATIO 


- NORMALZED HSTOCRAM INTERSECTION 


Fig. 9 Single view query average precision graphs on Caltech-256 dataset with various similarity functions: (a) our results 
with background cluttered queries and (b) our results with clean background queries. 



Fig. 10 Single view query examples on Caltech-256 dataset with two similarity functions: (a) Min-Max Ratio and (b) Nor¬ 
malized Histogram Intersection. 



Fig. 11 Multi-image query average precision graphs on Caltech-256 dataset with various early and late fusion methods and 
the Min-Max Ratio similarity function: (a) background cluttered queries, and (b) clean background queries. 
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Fig. 12 Single view and multi-image query examples on Caltech-256 dataset: (a) single view query, (b) multi-image query 
with Rank Sum late fusion method, and (c) multi-image query with Count late fusion method. The Min-Max Ratio similarity 
function is used. 



Fig. 13 Multi-image query examples on Caltech-256 dataset with early fusion: (a) Average Histogram, and (b) Weighted 
Average Histogram. The Min-Max Ratio similarity function is used. 
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Fig. 14 Multi-view query average precision graph on Caltech-256 dataset with various early and late fusion methods. The 
Min-Max Ratio similarity function is used. The multi-view query images were taken with a mobile phone. 
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Fig. 15 Multi-view query example on Caltech-256 dataset with Rank Sum fusion and Min-Max Ratio similarity function. 
The query images were taken with a mobile phone. 




NUHMAUZLUOTHHLIAIION — HISTOGRAM I WTFrSSFCTlON DU HKOLIIJC I MIN-MftK HA [ IC> 

NORMALIZED HISTOGRAM INTERSECTION 


Fig. 16 Single view query average precision graphs on the MVOD dataset with various similarity functions: (a) background 
cluttered queries and (b) clean background queries. 



Fig. 17 Multi-view query average precision graphs on MVOD dataset with various early and late fusion methods: (a) back¬ 
ground cluttered queries, (b) clean background queries. The Min-Max Ratio similarity function is used. 
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Fig. 18 Single view and multi-view query examples on MVOD dataset: (a) single view query, (b) multi-view query with Max 
late fusion method and (c) multi-view query with the Weighted Average Max late fusion method. 
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Fig. 19 Single view and multi-view query examples on MVOD dataset: (a) single view query, (b) multi-view query with Max 
late fusion method and (c) multi-view query with the Weighted Average Max late fusion method. 
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Fig. 20 Single view and multi-view query examples on MVOD dataset: (a) single view query, (b) multi-view query with Max 
late fusion method and (c) multi-view query with the Weighted Average Max late fusion method. 
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